Benchmarking Policies
In this tutorial, we will showcase the process of testing the accuracy of classifications made by the DynamoGuard using your dataset in CSV format. This will help you understand how your data interacts with our API and provide insights into improving your model's accuracy.
Prerequisites
Before you start, ensure you have the following:
- DynamoAI Access Token
- DynamoAI API URL
- Dataset follows required format (See section Dataset CSV Format Requirement)
The API URL can be found in the section sending POST requests in the Code Export
of policies.
Environment Setup
Ensure you have Python installed on your system and install the necessary libraries, including pandas
, numpy
, aiohttp
, sklearn
, and others as required by the code.
Asynchronous Structure
The benchmark script is designed with an asynchronous structure to enhance efficiency and scalability when sending multiple requests to the Guardrail LLM API. Asynchronous programming allows the script to initiate and manage numerous API calls concurrently, significantly reducing the overall execution time compared to a synchronous approach where each request must be processed sequentially.
This structure is particularly beneficial when dealing with large datasets, as it ensures that the script can handle a high volume of prompts without being bottlenecked by network latency or API response times. By leveraging Python's asyncio
library and the aiohttp
client for asynchronous HTTP requests, the script can send out requests, wait for responses, and process the results as they arrive, all in a non-blocking manner.
Utilizing asynchronous programming principles, the main
function orchestrates the entire process, from reading the dataset and initializing the benchmark instance to executing the accuracy test and saving the results. This approach ensures that the script remains responsive and efficient, even when scaling up to handle extensive datasets or when integrated into larger, more complex systems for automated LLM benchmarking.
Package Import
To begin, let's import the necessary libraries needed for the benchmarking process
import os
import time
import numpy as np
import pandas as pd
import asyncio, aiohttp
from datasets import load_dataset
from pprint import pprint
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score, confusion_matrix
import numpy as np
# Ignore warnings
import warnings
warnings.filterwarnings("ignore")
Dataset CSV Format Requirement
Your dataset should be in a CSV format with at least two columns: Prompt and Label. The Prompt column contains the text to classify, and the Label column contains the ground truth classification: safe
and unsafe
.
Example CSV format:
Prompt | Label |
---|---|
"Example prompt" | safe |
"Another example" | unsafe |
Prompt: A string containing the text (prompt) to classify. Label: A string (safe or unsafe) as ground truth classification of the prompt.
Code Summary
In the initialization phase, we set up the necessary parameters and read the dataset from a CSV file. This step involves specifying the API URL, the policy ID for testing, the authentication token for API access, and the filename of the CSV-format dataset as described above. You will be able to find arguments api_url
, policy_id
, model_id
, and auth_token
, in below code snippet, on DynamoGuard platform.
First, we define the DGuardBenchmark
class.
-
After loading the dataset, the main function initializes an instance of the
DGuardBenchmark
class with these parameters. This instance will be used to send asynchronous requests to the Guardrail LLM API for classifying the prompts, testing the accuracy of the classifications against the ground truth labels, generating a report of the results, and finally saving this report to a CSV file for further analysis. -
The
send_request
asynchronous method sends a single classification request to the DynamoAI API and captures the prediction. It ensures that the label is valid and handles any errors during the request. -
The
get_prediction
method orchestrates the sending of asynchronous requests for the entire dataset. It gathers the responses and calculates the accuracy based on the expected labels. -
Finally, the
report_results
and save_csv methods are used to generate a classification report and save the results, including the predictions and their accuracy, to a CSV file.
class DGuardBenchmark:
def __init__(self, api_url, policy_id, auth_token, model_id=None, saving_title="results"):
self.api_url = api_url
if isinstance(policy_id, str):
self.policy_id = [policy_id]
elif isinstance(policy_id, list):
self.policy_id = policy_id
self.model_id = model_id
if 'chat/session_id' in self.api_url:
assert self.model_id is not None, "Model ID is required for chat/session_id API endpoint"
self.headers = {
'Content-Type': 'application/json',
'Authorization': f'Bearer {auth_token}',
}
self.results = []
self.stats = []
self.preds = []
self.labels = []
self.saving_title = saving_title
async def get_predictions(self, prompts, labels, show_report=False):
"""
Test the consistency of classification and compute various metrics.
Returns a dictionary containing f1, accuracy, fnr, fpr, precision, and recall.
"""
tasks = []
preds = []
# Collect predictions
for prompt, label in zip(prompts, labels):
tasks.append(self.send_request(prompt, str(label)))
responses = await asyncio.gather(*tasks)
# Process responses
valid_preds = []
valid_labels = []
total_cnter = 0
for i, response in enumerate(responses):
input_prompt, ground_truth, pred = response
self.preds.append(pred)
self.labels.append(ground_truth)
self.stats.append((input_prompt, ground_truth, pred))
# Only include non-error predictions in metrics
if pred != 'error':
valid_preds.append(pred)
valid_labels.append(ground_truth)
total_cnter += 1
async def send_request(self, prompt, label):
"""Send a single asynchronous request to the DynamoGuard API."""
# Sanity check
label = label.lower() # Allow case-insensitive
if label not in ['safe', 'unsafe']: raise ValueError(f"Invalid label: {label}")
async with aiohttp.ClientSession() as session:
if 'chat/session_id' in self.api_url:
json_data = {
'messages': [{'role': 'user', 'content': prompt}],
"modelId": self.model_id
}
else:
json_data = {
'messages': [{'role': 'user', 'content': prompt}],
"textType": "MODEL_INPUT",
"policyIds": self.policy_id
}
try:
await asyncio.sleep(0.1)
async with session.post(self.api_url, headers=self.headers, json=json_data, ssl=False) as response:
response_json = await response.json()
if 'chat/session_id' in self.api_url:
final_action = response_json['analyses'][0]['finalAction']
else:
final_action = response_json['finalAction']
if final_action == 'NONE':
pred = 'safe'
else:
pred = 'unsafe'
except Exception as e:
try:
pprint(f"Request failed: {e}. Response received: ")
pprint(response_json)
except:
pprint(f"Request failed: {e}. No response received.")
pred = 'error'
return prompt, label, pred
def report_results(self):
"""Generate the classification report and prepare it for CSV export."""
# Generate classification report as a dictionary
report = classification_report(self.labels, self.preds, target_names=['safe', 'unsafe'], output_dict=True)
#drop the average row
accuracy = report['accuracy']
report.pop('accuracy')
# Convert to DataFrame, transpose for easier manipulation, and reset index to make the index a column
self.report_df = pd.DataFrame(report).transpose().reset_index().round(2)
# Rename columns to include the title as the first column
self.report_df.columns = ['Title'] + self.report_df.columns[1:].tolist()
# Add an empty column to the report DataFrame for alignment
self.report_df[' '] = ''
# Insert a divider row
divider_index = self.report_df[self.report_df['Title'] == 'weighted avg'].index[0] + 1
divider_row = pd.DataFrame([['---', np.nan, np.nan, np.nan, np.nan, '']], columns=self.report_df.columns, index=[divider_index])
self.report_df = pd.concat([self.report_df.iloc[:divider_index], divider_row, self.report_df.iloc[divider_index:]]).reset_index(drop=True)
# Add one more row for the accuracy
accuracy_row = pd.DataFrame([['accuracy', np.nan, np.nan, np.nan, f"{accuracy:.2f}", np.nan]], columns=self.report_df.columns)
self.report_df = pd.concat([self.report_df, accuracy_row]).reset_index(drop=True)
def save_csv(self):
"""Save the report and results to CSV, with the report at the top-left."""
# Convert stats to DataFrame
stats_df = pd.DataFrame(self.stats, columns=["Prompt", "Ground Truth", "Prediction"])
# Prepare an empty DataFrame to align the report with stats
empty_rows_needed = max(0, len(stats_df) - len(self.report_df))
empty_df_for_alignment = pd.DataFrame(np.nan, index=range(empty_rows_needed), columns=self.report_df.columns)
# Concatenate the empty DataFrame with the report for vertical alignment
aligned_report_df = pd.concat([self.report_df, empty_df_for_alignment], ignore_index=True)
# Now concatenate the aligned report with the stats horizontally
final_df = pd.concat([aligned_report_df, stats_df], axis=1)
# Save to CSV
filename = f"{self.saving_title}_{time.strftime('%Y%m%d-%H%M%S')}.csv"
print(f'*** Saving results to CSV with filename: {filename} ***')
final_df.to_csv(filename, index=False)
The main function demonstrates this process step by step, ensuring a streamlined workflow from dataset preparation to result analysis.
async def main():
'''
api_url: str, the endpoint of the Guardrail LLM API
- At this moment (Apr. 17, 2024), the endpoint should be either
`https://api.dynamo.ai/v1/moderation/analyze/` (to specify policy/ policies to test on) or
`https://api.dynamo.ai/v1/moderation/chat/session_id` (to test all deployed policy)
policy_id: str or list, specify policy ID to test against.
- When specifying policy/policies to benchmark, please use `/analyze/` endpoint.
- If policy_id is a list, the benchmark will be conducted for all policies in the list.
model_id: str, specify model ID to test against.
auth_token: str, authentication token for API access
filename: str, path to the CSV file containing the dataset
'''
api_url = "https://api.uat.dynamo.ai/v1/moderation/analyze/"
policy_id = <policy-id> # The specific policy ID to test against
model_id = None
auth_token = <auth-token>
filename = "dynamoai-ml/v2408_FinancialAdvice-benchmark-outguard-data_Meta-Llama-3.1-70B-Instruct-Turbo"
input_col = 'prompt'
label_col = 'label'
# Read prompt from CSV
try:
data = load_dataset(filename)['train']
except:
data = pd.read_csv(filename)
try:
prompts = data[input_col]
labels = data[label_col]
except:
warnings.warn("'prompt' and 'label' extraction led to a key error. Assuming this is not an output guardrail dataset with 'response' and 'response_label'")
input_col = "response"
label_col = "response_label"
prompts = data[input_col]
labels = data[label_col]
if type(prompts) != list:
prompts = data[input_col].tolist()
if type(labels) != list:
labels = data[label_col].tolist()
benchmark = DGuardBenchmark(
api_url=api_url,
policy_id=policy_id,
auth_token=auth_token,
model_id=model_id,
saving_title= filename.split('/')[-1].split('.')[0]+f'_results'
)
await benchmark.get_predictions(prompts, labels)
benchmark.report_results()
benchmark.save_csv()
await main()
Metrics
In the result, CSV file generated by the Guardrail LLM Benchmark tool, several key metrics are included to evaluate the performance of the classifications. These metrics are essential for understanding how well the model performs across different aspects:
-
Precision: The ratio of correctly predicted positive observations to the total predicted positives. It shows how many of the positively predicted cases were actually positive.
-
Recall (Sensitivity): The ratio of correctly predicted positive observations to all observations in the actual class. It measures the model's ability to capture actual positives.
-
F1-score: The harmonic mean of Precision and Recall. F1-score reflect comprehensive performance especically the dataset is imbalanced. For instance, safe/unsafe prompts are much more than another category.
-
Support: The number of actual occurrences of the class in the specified dataset. It provides insight into the imbalance of the dataset with respect to the classes.
-
Macro Average: The average of the Precision, Recall, and F1-score for each class, without taking the imbalance of the classes into account. It treats all classes equally, calculating metrics for each class independently and then taking the average.
-
Weighted Average: Similar to the Macro Average, but in this case, each class's metric is weighted by its support. This accounts for class imbalance by giving more weight to metrics of classes with more instances.
-
Accuracy: The accuracy of the DynamoGuard's prediction of if the input prompt is safe or unsafe.
These metrics together provide a comprehensive overview of the model's performance, highlighting areas of strength and potential improvement. Precision, Recall, and F1-score offer a balanced view of the model's accuracy, while Support indicates the dataset's class distribution, and the averages (Macro and Weighted) give insights into the model's overall effectiveness across different classes.
Conclusion
Following this guide will allow you to test the classifications made by the DynamoAI API against your dataset accurately. This process is crucial for understanding the performance of your models and making necessary adjustments.
Legal Notice Copyright 2024 DynamoAI, Inc. All rights reserved.
This software is provided "as is", without warranty of any kind, express or implied. Use of this software is governed by the Terms of Use between you or your employer and DynamoAI. The above restrictions highlight the prohibition of unauthorized use, distribution, and modification.